Search CORE

40 research outputs found

Stronger Baselines for Trustable Results in Neural Machine Translation

Author: Denkowski Michael
Neubig Graham
Publication venue
Publication date: 01/01/2017
Field of study

Interest in neural machine translation has grown rapidly as its effectiveness has been demonstrated across language and data scenarios. New research regularly introduces architectural and algorithmic improvements that lead to significant gains over "vanilla" NMT implementations. However, these new techniques are rarely evaluated in the context of previously published techniques, specifically those that are widely used in state-of-theart production and shared-task systems. As a result, it is often difficult to determine whether improvements from research will carry over to systems deployed for real-world use. In this work, we recommend three specific methods that are relatively easy to implement and result in much stronger experimental systems. Beyond reporting significantly higher BLEU scores, we conduct an in-depth analysis of where improvements originate and what inherent weaknesses of basic NMT models are being addressed. We then compare the relative gains afforded by several other techniques proposed in the literature when starting with vanilla systems versus our stronger baselines, showing that experimental conclusions may change depending on the baseline chosen. This indicates that choosing a strong baseline is crucial for reporting reliable experimental results.Comment: To appear at the Workshop on Neural Machine Translation (WNMT

arXiv.org e-Print Archive

Crossref

Evaluating MT systems with BEER

Author: Birch
Birch
Chen
Chen
Cherry
Cherry
Clark
Clark
Hall
Hall
Herbrich
Herbrich
Isozaki
Isozaki
Koehn
Koehn
Macháček
Macháček
Michael Denkowski
Michael Denkowski
Papineni
Papineni
Stanojević
Stanojević
Stanojević
Stanojević
Stanojević
Stanojević
Stanojević
Stanojević
Talbot
Talbot
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2015
Field of study

We present BEER, an open source implementation of a machine translation evaluation metric. BEER is a metric trained for high correlation with human ranking by using learning-to-rank training methods. For evaluation of lexical accuracy it uses sub-word units (character n-grams) while for measuring word order it uses hierarchical representations based on PETs (permutation trees). During the last WMT metrics tasks, BEER has shown high correlation with human judgments both on the sentence and the corpus levels. In this paper we will show how BEER can be used for (i) full evaluation of MT output, (ii) isolated evaluation of word order and (iii) tuning MT systems

Crossref

Directory of Open Access Journals

Edinburgh Research Explorer

Publikationsserver der RWTH Aachen University

Taking MT evaluation metrics to extremes : beyond correlation with human judgments

Author: Bahdanau Dzmitry
Baig Taimur
Banerjee Satanjeev
Berentsen Geir Drage
Bojar Ondřej
Callison-Burch Chris
Callison-Burch Chris
Coughlin Deborah
Culy Christopher
Denkowski Michael
Fomicheva Marina
Giménez Jesús
Graham Yvette
Hjort Nils Lid
Junczys-Dowmunt Marcin
Levene Howard
Liu Ding
Lucia Specia
Marina Fomicheva
Moore Robert C.
Nießen Sonja
Papineni Kishore
Snover Matthew
Specia Lucia
Specia Lucia
Specia Lucia
Specia Lucia
Sutskever Ilya
Tillmann Christoph
Williams Evan James
Publication venue: 'MIT Press - Journals'
Publication date: 12/06/2019
Field of study

Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new metrics devised every year. Evaluation metrics are generally benchmarked against manual assessment of translation quality, with performance measured in terms of overall correlation with human scores. Much work has been dedicated to the improvement of evaluation metrics to achieve a higher correlation with human judgments. However, little insight has been provided regarding the weaknesses and strengths of existing approaches and their behavior in different settings. In this work we conduct a broad meta-evaluation study of the performance of a wide range of evaluation metrics focusing on three major aspects. First, we analyze the performance of the metrics when faced with different levels of translation quality, proposing a local dependency measure as an alternative to the standard, global correlation coefficient. We show that metric performance varies significantly across different levels of MT quality: Metrics perform poorly when faced with low-quality translations and are not able to capture nuanced quality distinctions. Interestingly, we show that evaluating low-quality translations is also more challenging for humans. Second, we show that metrics are more reliable when evaluating neural MT than the traditional statistical MT systems. Finally, we show that the difference in the evaluation accuracy for different metrics is maintained even if the gold standard scores are based on different criteria

Crossref

Spiral - Imperial College Digital Repository

White Rose Research Online

Machine Translation for Human Translators

Author: Michael Denkowski
Publication venue
Publication date
Field of study

While machine translation is sometimes sufficient for conveying information across language barriers, many scenarios still require precise human-quality translation that MT is currently unable to deliver. Governments and international organizations such as the United Nations require accurate translations of content dealing with complex geopolitical issues. Community-driven projects such as Wikipedia rely on volunteer translators to bring accurate information to diverse language communities. As the amount of data requiring translation has continued to increase, the idea of using machine translation to improve the speed of human translation has gained interest. In the frequently employed practice of post-editing, a machine translation system outputs an initial translation and a human translator edits it for correctness, ideally saving time over translating from scratch. While general improvements in MT quality have led to productivity gains with this technique, there has been little work on designing translation systems specifically for post-editing. In this work, we propose improvements to key components of statistical machine translation systems aimed at directly reducing the amount of work required from human translators. We propose casting MT for post-editing as an online learning task where new training instances are created as humans edit system output, introducing an online translation model that immediately learns from post-editor feedback. We propose an extended translation feature set that allows this model to learn from multiple translation context

CiteSeerX

Meteor Universal: Language Specific Translation Evaluation for Any Target Language

Author: Alon Lavie
Michael Denkowski
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2014
Field of study

This paper describes Meteor Universal, re-leased for the 2014 ACL Workshop on Statistical Machine Translation. Meteor Universal brings language specific evalu-ation to previously unsupported target lan-guages by (1) automatically extracting lin-guistic resources (paraphrase tables and function word lists) from the bitext used to train MT systems and (2) using a univer-sal parameter set learned from pooling hu-man judgments of translation quality from several language directions. Meteor Uni-versal is shown to significantly outperform baseline BLEU on two new languages, Russian (WMT13) and Hindi (WMT14).

CiteSeerX

Crossref